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Modeling Shape, Appearance and Self-Occlusions 
for Articulated Object Tracking 

Yanchao Yang and Ganesh Sundaramoorthi 
Abstract 

We present a method for object tracking so that precise object shape can be obtained. UnUke previous tracking 
methods that build on image segmentation (by separating foreground and background statistics), which are Hmited to 
uncluttered background and simple appearance of the object, and tracking by detection approaches that have complex 
appearance models but simplistic models of shape, we track by identifying stationary statistics of both appearance and 
shape over time, and therefore obtain accurate shape in complex changing backgrounds. Our method is an adaptive 
template matching scheme that applies to objects with large deformations and articulation as well as a camera that 
moves and changes viewpoint relative to the object. The significant aspect that must be addressed in such a template 
matching scheme is that the shape of the projected object into the imaging plane is quickly changing from the 
complex image induced transformation due to 3D articulation and deformation, viewpoint change, occlusions and 
disocclusions of the object due to viewpoint change, and self-occlusions and self-disocclusions due to articulation. 
We provide the fundamentals to understand occlusions and disocclusions of the object, a model of the shape and 
appearance of the projected object in time taking into account occlusion phenomena, and a computational algorithm 
to obtain the precise shape of the object. We illustrate the ideas on challenging video sequences and obtain state- 
of-the-art results. 

Index Terms 

Object tracking, occlusion detection, disocclusion detection, shape models, appearance models, non-rigid shape 
registration, large deformation registration, optical flow 

I. Introduction 

A key problem in computer vision is that of object tracking from video. In many applications, e.g., entertainment 
(post-production of motion pictures, including 3D video [ ]), and 3D reconstruction, it is important to obtain precise 
cutouts of the object of interest (i.e., determine precise shape) in every frame of the video. Although several methods 
have been proposed, full automation is far from reality, and many fundamental challenges remain to be resolved. 
Resolving these fundamental challenges is not only important for the applications, but also for shedding light on 
building other computer vision systems. The fundamental challenge of the object tracking problem (as w^ell as other 
vision problems such as recognition) is that the same 3D object can appear in infinitely many variations projected 
in the camera imaging plane. These variations arise from both the intrinsic state of the 3D object, that is, the 
pose of the object due to its motion, deformation, and articulation (e.g., the changing pose of a human w^alking), 
and extrinsic factors due to the ambient environment (clutter) and the image formation process. Extrinsic factors 
include: movement, i.e., viewpoint, of the camera, occlusion (both from other objects and itself, i.e., self -occlusion), 
illumination condition of the ambient environment, quantization level of the camera in both space and time, and 
finally noise generated by the camera. 

Many segmentation techniques, most popularly, active contours [ ] have been, because of their ability to represent 
a w^ide variety of shapes that can arise w^hen the 3D object is projected into the plane, applied to the object tracking 
problem (w^hen precise shape is desired). Most of these methods apply segmentation to detect the object frame-by- 
frame and achieve temporal-coherence simply by using the final segmentation from the previous frame to initialize 
the segmentation in the next frame or simple dynamics of position, i.e., a finite dimensional group, (e.g., [],[], 
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[5], [ ]), or even dynamics of shape [7], a finite-dimensional representation formed from training shapes, or even 
more general shape dynamics [ ] to attain a better initialization. However, the performance of these methods are 
limited in complex environments arising from natural scenes since such methods are based on segmentation by 
separating the image into foreground and background using simple primitives such as color, texture and edges, and in 
complex scenes, this leads to over or under segmentation of the object. In other words, the model of the appearance 
of the object is too simplistic (or is even non-existent as most methods of this type discriminate foreground and 
background statistics). On the otherhand, powerful methods from the object detection literature, which have more 
sophisticated object- specific models of appearance learned from training data, have been applied to object tracking 
[9], [10]. However, these methods have limited models of shape (e.g., boxes), and are therefore unable to offer 
precise cutouts required for several key applications. 

In this work, we create realistic joint models of the projected object's shape ^ and appearance in order to track 
an object and obtain accurate cutouts. Our goal is to track an object by matching a template (both its shape and 
appearance) to the current image in an online tracking scheme; the template can be obtained, for example, from 
a manual cut out in the first frame. Our technique is motivated by the remarkable property of the human visual 
system, which is able to track with one or very few training samples even under significant viewpoint change, 
3D deformation, and occlusion. Our technique is based on the principle that the object of interest possesses some 
stationary statistics over time, and therefore, the object should be tracked by identifying the statistics in the image 
that are similar to the template, which would imply stationarity, rather than tracking the object by separating 
foreground and background as in segmentation approaches. Since the background in many applications of interest 
may change rapidly (and therefore, is not stationary), it is preferable not to model the background, which would 
be rather complex, and a waste of computations since only the object is of interest. However, as the 3D object 
moves and deforms and as the camera moves, the template, over time, becomes an inaccurate model of the object 
in the current frame. This is the case, because 3D deformation of the object and viewpoint changes of the camera 
lead to complex transformations in the imaging plane of the object as the object is not flat, and self-occlusions 
and self-disocclusions of the object by articulation and occlusion / disocclusion from viewpoint changes lead to 
discontinuities in the transformation and also parts of the projected object come in view and go out of view. Our 
objective in this work is to be able to explicitly model the aforementioned types of occlusions, detect them, and 
be able to adaptively update the template so that the template remains an accurate representation of the object in 
the current frame (both in shape and appearance). This is done by constructing a (infinite-dimensional) dynamic 
system of both shape and appearance, and deriving a recursive estimation strategy in order determine the object 
precisely in all frames. 

A. Related Work 

Many tracking techniques extend segmentation techniques frame-to-frame for obtaining cut-outs in video. Since 
fully automatic segmentation is difficult, many interactive segmentation methods (e.g., [],[]) have been 
designed where a user can input a key region and interactively correct the segmentation. Interactive techniques 
for segmentation have been extended to video recently (e.g., [17], [18], [19], [20]). In [ ], the method is based 
on foreground and background separation in the entire volume created by the image sequence. As foreground / 
background separation leads to over/under segmentation in cluttered situations, [ly] instead uses localized classifiers 
for separating foreground/background in a neighborhood of an initial contour. In [2r], motion information from 
optical flow is integrated to incorporate temporal consistency. While the methods work well in many cases with 
interaction, none explicitly model self-occlusions, which we show is essential for a template matching scheme, and 
leads to state-of-the-art results. 

There has been much work on modeling the dynamics of the motion of the object projected in the imaging plane 
(e.g., [4], [6], [5], [21]) to improve segmentation techniques, most notably active contour techniques (e.g., [22], 
[23], [24], [25], [26], [27], [28], [29]). These techniques aim at better initialization for the segmentation (rather than 
the segmentation from the previous frame) by using the dynamics to predict the object in the next frame. Modeling 

^In this work, we define shape of the projected object to be any closed, compact subset of (with non-zero measure) whose boundary 
is sufficiently smooth as in [. This includes regions with any number of holes. Further, unlike much of the literature in shape analysis 
stemming from [12], e.g., [13], [14], we do not define shape modulo a finite dimensional group (such as the Euclidean group) as in tracking, 
pose is important. 
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dynamics of not just motion (a finite dimensional group), but also dynamics of (infinite-dimensional) shape has 
been accomplished in [ ] to make more accurate predictions of the object in the next frame. In [ ] dynamics of 
shape is modeled through training data. Training data is difficult to obtain (and tremendous training data would 
be required) for the types of video that is of interest in the proposed paper (i.e., near-field video^) where there 
is significant variability that must be accurately captured (e.g., hair, clothing, and accessories). Some robustness 
to minor occlusion from other objects is achieved in these works since filtering is done to combine the results of 
prediction with the results of segmentation (and weighting the prediction high in comparison to the segmentation 
allows some robustness to occlusion). However, the fundamental limitation of these methods is that they ultimately 
rely on segmentation driven by foreground / background separation, which in cases of clutter leads to over or under 
segmentation. 

Rather than building on segmentation to do tracking, appearance models of the object have been employed in 
tracking as statistics of the appearance of the object are roughly stationary. For example, [30], [31] and more 
recently, [10], [9], [31 ] have employed appearance models to track an object. However, the obvious limitation of 
these approaches is the inability to capture precise shape as they are box trackers, and the training requirement 
may be excessive for near-field video. Attempts have been made to track with both an appearance model and have 
more accurate representation of shape, for example, in [33], the object is tracked by deforming an initial contour 
so that its intensity histogram in the region defined by the contour matches a prior intensity distribution. In this 
approach, however, the appearance model is considerably less powerful as spatial relations between pixel intensities 
are removed when considering histograms. 

Active appearance models [ ] can model both shape and appearance of the object, however, shapes are rep- 
resented by parametric models that are formed from training data, and this limits the shape of the object. Non- 
parametric models of appearance and shape for matching is given by the Large Deformation Diffeomorphism 
(LDDMM) framework, typically used in medical imaging, [ ^], [36] (also related is [ ]) where the shape and 
appearance of a template is deformed via a geodesic flow to match a target image. However, the smooth restriction 
of the deformation is not realistic for objects projected in the (2D) imaging plane, as self-occlusions, disocclusions, 
and occlusions due to viewpoint change result in maps that are not smooth or even non-existent at locations. Thus, 
these methods would not be applicable to the tracking problem. The Deformotion framework [38], [39] also deals 
with joint models of appearance and shape, but for both foreground and background. Moreover, occlusions and 
discontinuities in the map due to self-occlusions are not considered. 

The problem of occlusions has been considered in optical flow problems [40] recently and in the past. In [41], 
the forward and backward optical flows are computed, and the occluded region is defined as the region in the 
images where the composition of the forward and backward map does not yield the identity map. A similar idea 
for stereo disparity is used in [42]. In [43] and [44], occlusions are considered to be regions where the optical 
flow residual is large. In [4S], occlusion boundaries are detected by discontinuities of motion estimates within 
a small ball centered around the pixel of interest. In [46], [47], [48], joint estimation of the optical flow and 
occlusion is performed. An occlusion is considered a region in image 1 that does not correspond to image 2. Our 
work generalizes such an approach to large deformations (that optical flow cannot capture) and object tracking, 
where additional considerations must be made as the detected occlusions in [ ], [ ] do not distinguish between 
occlusions/disocclusions of the object of interest, another object or the background, which we shall see is vital for 
object tracking. Recently, [ ] have considered optical flow and occlusion detection for large deformations. Our 
work goes beyond this by distinguishing occlusions and disocclusions, modeling the whole tracking process (not 
just optical flow) via a dynamic system that models both appearance (so that deviations from brightness constancy 
can be modeled) and shape, and further our method is not restricted to finite dimensional parametric models of 
velocity. 

II. Mathematical Model 

In this section, our goal is to formalize the problem of determining the precise shape of the projected object as 
an inference problem given an assumed mathematical model representing how the projected object changes in time. 
We derive a realistic dynamic model of the projected object that is obtained from the image formation process, 

^We use the term near-field video to indicate video where objects are rather close to the camera, and thus shape changes and self-occlusions 
are significant. 
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and movement of the camera and object. From the dynamic model, the notion of occlusions and disocclusions will 
become clear, facilitating their computation in future sections. The dynamical model is also a necessary ingredient 
for the recursive estimation strategy that we employ in the next sections. 

Let C denote the domain of the image sequence, and / : {1,2,..., K} x ^ denote the image 
sequence (sampled at K instances of time) that has k channels. We denote by It the image sampled at time t. 
Suppose that St C is the surface of the object (in three-dimensions) that deforms in time, and that pt : St ^ 
is the reflectance of the object. The surface is assumed to be smooth, compact and without boundary. We assume 
the following model: 

It{x) = pt{7T^^{x)) + r]t{x), xeRtCn (1) 

where Rt (a compact, closed, non-zero measure set) denotes the points of the surface of the object St that project 
into the imaging plane at time t, tv^^ denotes the back-projection of the points in the imaging plane at the viewpoint 
of the camera at time t to the surface St, which moves and deforms in time, and r]t : ft ^ denotes a noise 
process. The model is the Brightness Constancy Assumption [ ] (when pt = p and r]t = 0) that is used in 
computer vision; it assumes that the object has Lambertian reflectance. The noise process r]t models noise in the 
image formation process as well as modeling errors from the Lambertian assumption. Note that we only model the 
object appearance and not the background appearance. It is based on the principle that in tracking an object, that 
there exists some statistics of the object that are stationary in time, but the background appearance (due to clutter 
and camera movement) is less stationary or not at all. Modeling the background would thus require sophisticated 
models, and incorporating such models into the algorithm would waste tremendous computations for the background 
that does not need to be estimated for the problem of determining the object anyway [ ]. 

Our goal is to estimate the projected region Rt of the object St into the imaging plane, and its appearance 
at = Pt o 7T^^ (at : Rt M^) given the (2D) image sequence. While in some applications (e.g., video editing), 
one may assume that the whole image sequence is available, for other (real-time) applications such as 3D-TV, the 
processing must be done online, and thus we derive a method for the online case. For simplicity in our model, we 
will assume that pt^p^. We will allow the surface of the object St to deform and move in time, and assume that 
the camera has its own motion. We assume the following dynamic model for the region (shape) of the object, its 
appearance, and its deformation / motion: 



Rt = {wt-i{Rt-i\Ot-i) U A-i) e it (2) 
at-i{w~\{x)) + rit{x) X G wt-i{Rt-i\Ot-i) 
af{x) + r]t{x) X G A-i 



cit[x) = <; ^ ^ ; ^ ^ _^ (3) 



where Ot denotes the subset of the region Rt that has been occluded from view at time t, Dt denotes the subset of 
the projected object that has been disoccluded (comes into view) at time t + 1, : Dt-i is the appearance 

of the disoccluded region, wt-i : Rt-i\Ot-i Rt is the (dense) correspondence mapping points (that are not 
occluded) in Rt-i to the region Rt in the next frame, and © denotes an operation that perturbs a set (region in the 
plane) via a noise process One can also model the evolution of the warp in time analogous to [ ] (as done for 
curves not regions), however, our main focus in this paper is occlusions, shape and appearance models, since as 
we will see in the experiments in Section V this is one of the most fundamental properties that leads to significant 
performance increase over the state-of-the-art. Figure 1 illustrates all the quantities of the model (2), (3). 

The model says that at consecutive times, the co-visible part of the surface St (the points on the surface that 
project to points in imaging plane at both times) projected in the imaging plane is related between frames t and 
t + 1 by a warp wt that is determined from the viewpoint of the camera, the surface St, and its deformation. As 
the surface of the object will be curved, the camera near the object, and the surface deforming, the map wt is 
non-parametric, and generally a diffeomorphism (when the surface is smooth) in the un-occluded region. We model 
only the warp wt in two dimensions as we have only images that are defined in 2D. A subset of Rt-i, that is, 
Rt-i\Ot-i, is warped by wt-i (the induced warping in the plane) and then the portion of the surface appearing 
in view at time t that is not in view at t — 1, Dt-i, is appended to the warped region to form Rt (after a small 
perturbation by ^t to model errors in the process). Note that occlusions and disocclusions in tracking (even when 

^Technically, since p : St ^ depends on time since St depends on time, p will also be a function of time; the meaning of the 
expression pt = p implies that corresponding points on the deforming surface have the same reflectance. 
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Fig. 1: This figure shows a diagram illustrating our model (left: template {Rt-i, at-i), right: It). The self-occlusions 
Ot-i are indicated in black, self-disocclusions Dt-i are in red, the appearance of the disoccluded region is a^, the 
region at time t — 1 is Rt-i and at time t is Rt (the region inside the green contour), and the warp is wt-i, which 
is defined in Rt\Ot-i. Notice that the curved black line is a self-occlusion since the arm moves towards the left, 
and the curved black line represents part of the discontinuity set of wt-i. 



there are not occlusions of the tracked object by other objects) are relevant because of changes of relative viewpoint 
of the object with respect to the camera, and the fact that the object can deform in ways so as to self-occlude or 
self-disocclude itself (e.g., the arm of a person occluding and disoccluding the leg as a person walks). The relevant 
portion of the appearance, i.e., at-i\{Rt-i\Ot-i) is transfered via the warp wt-i to Rt, noise added, and then a 
newly visible appearance is obtained in the disoccluded region Dt-i. The model can be thought of as a constant 
appearance plus noise model although, since the shape (region) of the object changes in time, and the appearance 
is defined on the region, necessarily, the appearance is changing (being warped). The model relates to Deformotion 
[38], [ ], and generalizes that model to consider (self) occlusions and disocclusions. 

Our goal in the rest of this paper is to use the model above to infer the object projected into the plane Rt, and 
its appearance at given the image sequence /. As a by product, we will also estimate the correspondence wt, the 
occlusion Ot, and disocclusion Dt. The inference of Rt and at from (t = 1, . . . , K) given our model (2), (3) 
is ill-posed, and must be solved with realistic prior assumptions. In the next section, we describe the weakest and 
smallest set of assumptions so that Rt and at can be inferred and estimated. 

III. Occlusions and Energy-Based Formulation 

We derive a Luenberger observer [52], [5j\ albeit non-linear [ ] (as the model (2) and (3) is non-linear) to 
recursively estimate the appearance and shape of the object projected into the imaging plane. We use the "hat" 
notation to denote estimated quantities. Specifically, at\t-i and Rt\t-i denotes the predictions at time t using 
the image data (measurements) and previous estimates up to time t — 1, and at\t and Rt\t denote the estimates 
of the appearance and shape using the measurements up to time t, respectively. We assume that we are given 
the estimates at\t and Rt\t, and we describe the process of obtaining the prediction and estimates at time t + 1 
when the measurement J^+i is available. In order to determine and a^+i|^^i, we estimate the warp wt, 

occluded region Ot, and the disoccluded region Dt via an optimization problem. We describe next the setup of the 
optimization problem and the principles governing its formulation. 

In general, determining an occlusion or self-occlusion of the object at time t cannot be done from the image 
It alone without prior knowledge of the 3D scene, and in a general tracking scenario, 3D scene information is 
unavailable. For example, given a scene of chair and table (chair behind the table from the viewpoint of interest), 
one only knows there exists an occlusion of the chair from the image at the given viewpoint because there is a prior 
notion of the chair. However, the occlusion (self and disocclusion) can be partially resolved by observing multiple 
images from differing viewpoint in a process of movement of the observer [ ]. In the specific case of a moving 
and deforming object, given an estimate of the object projected into the imaging plane at time t and the image at 
time t + 1, this observation can be readily applied to determine the occluded region O^. In fact, the principle is 
that the occluded region Ot is the subset of Rt that does not correspond to a region in It+i- This includes both 
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occlusions and self-occlusions. The case of disocclusions and self-disocclusions of the object cannot be resolved 
given Rt and /t+i without a prior notion of the object. Indeed, if a part of the object projected in the imaging 
plane comes in view at time t + 1 and it was not observed earlier, then it is impossible to know whether the part 
newly in view is part of the object of interest, another object or the background without additional assumptions on 
the object. We illustrate a possible set of assumptions in Section III-B. Ideas of occlusions following this line of 
thought (only for occlusions not disocclusions) are also considered in [ ] for robust estimation of optical flow and 
occlusion detection in the entire image; we generalize the approach for tracking an object, and moreover, extend it 
to distinguish between occlusions / disocclusions of the object. In [ ], it is impossible to tell whether the calculated 
"occlusion" is an occlusion or disocclusion of the background, the object or another object, which is vital for object 
tracking. 

We now setup two optimization problems for determining the occlusion Of^i\f^i and the disocclusion ]Df^i\f^i 
given Rf^f, a^|^, and the image 7^+1. In a general, this would be determined from both the predictions Rt^i\t 
and a^+i|t, and the image measurement J^+i in the form of an optimization problem [ ]. Since we do not have 
an explicit dynamic model of the warp, we assume that Rt^i\t = Rt\u the constant appearance plus noise 
assumption implies that the best prediction for the appearance is at+i\t = ^t\t (assuming a white noise process r]t). 

A. Energy Formulation for Self-Occlusions 

Note the dependence in our model of the warp wt on the occlusion Ot, if one of these quantities were known the 
other could be estimated, but both are unknown, and this therefore calls for a joint optimization problem (energy 
to be minimized): 

Eo{0,w)= / |a^+i|^(x)-/^+i(it;(x))|^dx + Reg(i^)+Reg(0) (4) 

jRt+i\t\0 

where regularization terms are added to deal with well-posedness issues. Note that if no regularization on the 
occlusion is given above, the optimization would result in a trivial solution O = Rt^i\t- Given a moderate framerate 
of the camera device, it is realistic to assume that the occlusion is small in area compared to the object (indeed 
as the framerate becomes larger, the occluded region becomes a one-dimensional set, e.g., an edge), and further 
that the occluded region of the object is spatially continuous, this is largely true since the object in 3D is spatially 
continuous, and therefore, the property is largely perserved for the projection into the imaging plane. These two 
assumptions lead us to the regularity term 

Reg(O) = Area(O) + Length(aO); (5) 

note that Length(50) = |Vlo(^)|dx where 1 is the indicator function, and thus the term leads to spatial 
regularity of O. 

Due to the aperture problem, it is necessary to impose regularity on the warp, as several warps can explain the 
image J^+i and the given region Rt^i\t and appearance, a^+i|t. Our goal is to handle large, articulated deformations 
arising from moderately low frame rate and large motion / deformation of the object compared to the framerate. 
We assume that ^ 

it;(x) = 0(1, x), (/)(t,x) = x+ / 'u(t, (/)(t, x))dT, (6) 

Jo 

i.e., the warp w is formed by adding small increments v{t^-) (where v is the infinitesimal motion) to the identity 
map to form the warp outside the occluded region. This is reminiscent of the LDDMM registration framework that 
is used in medical imaging [ ], except that we generalize it to consider occlusion, a phenomena which does not 
exist in 3D medical data. The regularity term that we impose is spatial regularity of v, e.g., 

\\v{r,-)f= [ \Vv{r,x)\^dx, Reg(^) = [\\v{r,-)fdr (7) 

J ct){r,Rt+i\t\0) Jo 

The norm of the gradient is beneficial to deal with articulated motion rather than a TV, or total variation term 
that has been gaining popularity in optical flow problems [ ], [ ], as TV greatly favors a piecewise constant 
solution, whereas to model articulation in near-field video, a finer model than piecewise constant is desired. Note 
discontinuities of the warp wt due to self-occlusion can still be modeled with the norm, as the warp need not be 
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smooth across the set O (which includes self-occlusion). The norm | • |^ = • | where W > denotes a spatially 
varying weight function (designed to be robust to minor localized illumination change, secularities, and other 
disturbances). An example form of W will be given in Section V. An efficient approximate method to minimize 
Eo (4) will be described in Section IV. The optimizers of Eo are denoted Ot and wt, and then R^^i = wt{Rt\t\Ot) 
is the new region excluding the disoccluded region Dt, which we describe how to obtain next. 

B. Energy Formulation for Self-Disocclusions 

As stated earlier, in order to determine the disoccluded region (the portion of the object that appears in the image 
at time t + 1), it is necessary to make a prior assumption on the disoccluded region otherwise there is no way to 
determine whether the region is the object or background. Part of our contribution in this work is to note what is 
needed in order to determine occlusion / disocclusion, and we illustrate one prior assumption that is quite effective 
in many (by no means all) situations for resolving the disocclusion. Our prior assumptions are self- similarity of 
the reflectance p of the object, closeness of the disocclusion to dR[^i, and spatial continuity. The self-similarity 
assumption is realistic, i.e., that the object albedo is composed of a small number of repeated patches (this is 
largely true for humans, animals, and many other objects). This follows a line of reasoning akin to Non-Local 
Means ["'^] and dictionary methods [ ]. We will show how to incorporate these assumptions given the region 
Rf^i and the propagated appearance o^;~^. These assumptions can be incorporated into the following energy 
to be minimized: 

Ed{D) = / {l/PB^(ci(x)){It+i{x),x) - 1/Td) dx + Length(a7^) (8) 

JD 

where D is restricted to be a subset of cl(x) denotes the closest point of OR^^^ to x, Br{x) is a ball of 

radius r about the point x, and PBr^{c\{x)){It+i{x), x) denotes the likelihood that x ^ D belongs to the disoccluded 
region, and > is a weight to ensure a non-trivial solution (akin to a threshold of p). The likelihood can be 
chosen in a number of ways; we specifically choose it to have two components: one that measures the fit of It+i{x) 
(where x is in the candidate region for the disocclusion - see below for more details) to the local distribution of 
It^i within Br{cl{x)) H Rt^i versus the background S^(cl(x)) H (^\R[^i) in It+i, and the second that measures 
nearness of x to R[^i. One possible choice of this term is 

PB,{c\{x)){It+i{x),x) oc exp {-dR'^^^{xf/2aj + {d{It+i{x) , pin) - d{It+i{x) , Pout))) (9) 

where dR'^^^{x) indicates the signed Euclidean distance from x to Pin represents a Parzen estimate of the 

intensity distribution of J^+i in S^(cl(x)) H Pout represents a Parzen estimate of i^+i inside Br{c\{x)) H 

{^\{dR'^^_^ > e}) (e is chosen large enough so that the region includes the background and not the disocclusion 
- a practical choice of parameters will be given in Section V), and d a measure of closeness of /^+i(x) to the 
distributions. A schematic of the define quantities is shown in Figure 2. The computation of ps, bears some 
resemblance to the localized classifiers used in [ ]; however, there the localized classifiers are propagated from the 
previous frame to segment the entire image. In contrast, the computation of pB,, the probability of the disocclusion, 
is based on the fact that R[^i belongs to the object in and {di?;^^ > ^} does not belong to the object in 
The optimization of E^ seeks only to determine the disoccluded region (from regions that are already known to be 
part of the object in /t+i), not the entire segmentation of the object. 

The constructed energy is based on the idea that since the reflectance of the 3D object is self-similar, and 
therefore, the projection of the reflectance into the imaging plane, i.e., the appearance is also self-similar. It is 
therefore assumed that the appearance in the disoccluded region Df is similar to some portion of the appearance 
^t+i\t ° ^r^- T'^i^ assumption is largely true, for example, in humans, animals, and other objects. This would not 
be necessarily true when, for example, when the back of the head of a human is seen at time t, and then at time 
t + 1 the front of the face appears. Again, our objective in this paper is to illustrate the concepts of what is needed 
in order to resolve the challenging problem of occlusions and disocclusions, and to illustrate one possible set of 
assumptions that are effective in applications; we don't argue that these assumptions are always foolproof. 

C. Obtaining New Estimates of Shape and Appearance 

Our approach to tracking an object is a robust recursive estimation scheme to determine the appearance and shape 
of the object at every instance of time following the ideas of a Luenberger observer. The previous two energies 
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Br{c\{x))n{n\{dn^^^^>e}) 

Fig. 2: This figure illustrates a schematic demonstrating the quantities involved in the computation of the disocclusion 
energy (8). The left of the dark curved line is the region inside R^^i (the region before the disocclusion is 
determined). 



described the process of using the measurement It+i to obtain the measured quantities Ot, the occlusion, and IDf 
We now describe the process of obtaining estimates Rf^i\f^i and a^^i^f^i incorporating both the prediction Rt^i\t, 
at+i\t ^iid the measurement J^+i. Recall that the prediction of Rt^i\t is simply (due to a lack of a model 
on the warp evolution in time), and thus the estimate would be some weighted (weights obtained by the 

statistics of the noise process average of Rt^i\t and wt{Rt^i\t\Ot) U Df. This can be done in a number of 
ways requiring a shape metric, e.g., [ ], [ ] (though since the region and appearance are coupled, such techniques 
must be generalized), but for our purpose, we assume mostly full reliability of the measurement It ^, and thus, the 
estimate is 

^m|m = ^^t(^m|t\O0uA, (10) 

which assumes an infinite gain. In order to obtain a robust estimate of the appearance according to the model (3), 
which models changes in time due to the noise process 77 (e.g., due to illumination change), a correction of the 
predicted appearance is required: 

. X ({l-Ka)at+i\tow~^{x) + KaIt+i{x) X e wt{Rt\t\dt) .... 
[It+i[x) X e Dt 

where Ka > is the gain and theoretically would be calculated based on the statistics of the noise process, however, 
we do not assume a certain noise process, and in practice it chosen large if image appearance is reliable (e.g., no 
specularities, illumination change, noise, or any other deviations from the model), and small otherwise. 



IV. Optimization Method 

In this section, we describe our method to optimize the energies Eo and in (4), (8), and other implementation 
specific issues. We derive a method that consists of thresholding, smoothing, and other simple operations so that 
the tracking can be implemented easily and efficiently. We choose to represent the estimated region via level set 
methods [60] for sub-pixel accuracy of the shape, handling topology changes of the object projected into the plane, 
and for a number of other conveniences that will be apparent. This is a choice to illustrate the principles of our 
model, and it is certainly not the only choice. 

"^This is mainly for computational purposes, as not assuming full reliability would require an averaging procedure for shape. While this 
is possible, it requires significant machinery, which would distract from the main contribution of our work. Further, the assumption gives 
good results in practice. 
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A. Optimization of Eq 

We derive an approximate method to minimize the energy Eo (4) that takes into account the constraint on the 
warp (6) and regularity on the velocity field (7) to allow for large deformation. A full optimization of Eo with 
the constraints (6) and (7) leads to a high dimensional gradient descent PDE that is expensive and susceptible 
to local minimum; a simplified version of the energy (without occlusions) is considered in [ ], and it leads to 
a sophisticated set of PDE. Since speed is of utmost importance in object tracking, we consider an approximate 
greedy, "online" approach to optimize Eq. The idea is to deform the initial region Ri\i by an infinitesimal velocity 
optimizing the energy Eo, transport the appearance a^|^ by the velocity, and then iterate the process with the 
transported appearance, until an accurate match is obtained. 

Let T denote an artificial time variable parameterizing the iterative process described earlier, denote the warped 
region at time r, and ar : Rr ^ the transported appearance at time r. The approximate optimization scheme 
is the following set of coupled PDE: 

Vr = ^TgmmEo^approx{v^ 0;ar,Rr) (12) 

v,0 

drRr = Vr, Rq = Rt\t (13) 
drCir{x) — —Var{x) • Vr{x), X G Rr, CLQ = a^t (14) 

where dr denotes partial with respect to r, the second equation implies a warp of the region Rr infinitesimally 
by the velocity Vr, and the third equation is a transport equation, which "transports" the appearance from an 
infinitesimal time earlier to the current region Rr. The energy above is defined as 

Eo,approx{v,0;a,R) ^ / |/t+i(x) - a(x) - Va(x) • 'u(x)|^dx + / | Vi;(x)p dx + Area(O) + Len(O), (15) 
Jr\o Jr 

which represents a linearization of a term of Eo, and this is valid since the velocity Vr is assumed small. The 
optimization of Eo^approx is done using an alternating minimization in v and O akin to outlier rejection via a 
procedure analogous to RANSAC. The optimization can be done with a steepest descent or even relaxed to a 
convex optimization problem; however, our simple algorithm below yields effective results. We summarize the 
algorithm in the following steps: 

1) Set /c = and Ok = 0. 

2) Optimize Eo^approx{v, Ok] a, R) in v. The optimizing equation, using basic variational techniques, is 

iw{x){It+i{x) - a{x) -Va{x) •Vk{x))Va{x) x e R\Ok 
-XAvk{x) = < ^ ' (16) 

[0 xeOk 

which can be solved rapidly with a conjugate gradient solver (see Appendix A). The parameter A > is a 
weighting on the regularizer of v in (15). 

3) Optimize Eo^approx{vk,0]a,R) in O, i.e., set 

RQs{x;a,Vk) = \It^i{x) - a{x) - Va(x) • Vk{x)\'^, x e R (17) 
Ofc+i = {x G i? : (G, * Res)(x) > (18) 

where > is a threshold, equivalent to a weight on the area term, and Ga represents a (radially symmetric) 
smoothing kernel and the parameter a controls the amount of spatial regularity desired (approximately 
equivalent to the effects that the weight on the length term has in terms of spatial regularity). See below for 
the justification that this definition is an approximation to the global optimum of Eo^approx in O. 

4) Increment k and repeat Steps 2-3 until the energy does not decrease. 

That the optimization of Eo^approx{v, O; a, R) with respect to O can be approximated by (18) is based on the 
following simple argument. It is apparent that minimizing Eo^approx(v,0;a^ R) in O is equivalent to optimizing 
the energy 

E{0) = / (-Res(x; a, Vk) + To) dx + Length(aO) (19) 
Jo 

and To is a weighting on the area term. For the moment, ignoring the length term, it is easy to see that a global 
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minimum solution for the energy above in this special case is 

O/e+i = {x e R: -Res(x; a, Vk) + To < 0} = {x e R : Res(x; a, Vk) > To}. (20) 

The length term is to maintain spatial regularity of Ok, and spatial regularity can be maintained if Res(x; a, Vk) is 
spatially smooth, and this can be achieved by smoothing Res(x; a, v^) and hence (18). Note that the smoothing and 
then thresholding is an approximation to a non-linear diffusion term related to the length term, and can be fully 
justified to converge to the true solution in the limit (see for example, ideas in [^'], [ ]). 

Note that according to (12)-(14), although the occlusion is approximated at each artificial time r, it is not removed 
from the region Rr (at this stage). The appearance and region are transported in the occluded region by the velocity 
Vr, which is interpolated from the un-occluded region via (16). As the velocity does not physically exist in the 
occluded region, this is a desirable property. The scheme has the effects of robust norms [ ], and in addition 
explicitly gives an estimate of the occluded region. It should be noted that the PDE (14) and (13) are transport 
PDE and require an up winding difference scheme (see Appendix A for discrete implementation details). 

The evolution in r of (12)-(14) is stopped when the energy Eo^approxi^r^ Or] Rr) has converged. At this 
time. Too, ^Too is the portion of including a hallucinated propagation of the occlusion Of and excluding 

the disoccluded region jji- The region excluding the disocclusion is now: 

Rt^i = RrA{x ^ Rr^ : {Ga^Ros{-;ar^,VrJ){x) > To}. (21) 

The propagated appearance a^+i|t o w:^^ is a^-^ in R^^i. The appearance in the occlusion region is an interpolation 
of values along the velocity, and does not represent the physical appearance of the object, and will be eliminated 
anyway at the removal of the occlusion. 

Note due to large displacement of the object between frames, it is desired to perform the optimization in a course- 
to-fine manner, and thus, we start the optimization above by restricting Vr to be affine motions until convergence, 
and then resort to a full non-parametric Vr. To continue a coarse-to-fine approach, at initial time r the regularity 
parameter A is chosen large until convergence of the energy Eo^approx, then A is lowered (to capture finer details) 
at larger time r, and the process is iterated. 



B. Optimization of E^ 

The addition of the disoccluded region IDt and the determination of the estimate R^^i^t^i is rather simple. 
The minimum of the energy E^ can be approximated very fast and accurately in one thresholding step, once the 
likelihood J>B,(ci(x))(^t+i(^)5 ^) is computed. Since the likelihood decreases exponentially with distance to R[^i, 
we assume that C {0 < di?;^^ < e} BR^^^^{e) where indicates the distance function to R[^i. The 

disocclusion is obtained in a simple step (and this approximation to the global optimum has similar justification as 
minimization of Eo^approx in Ok)'. 

Dt = {xe : *PB.(ci(.))(^) > Td} (22) 

where a is equivalent to the weight on the length term, and a smaller threshold Td indicates a preference for a 
larger Df The choice of Td is chosen based on the framerate of the camera and the speed of motion of the object 
(the more the speed and the less the framerate, the larger Td). Finally, we have that 

^m|t+i = ^mUA. (23) 

The computation of di?;^^ in Bji'^^^(e) can be computed very rapidly with the Fast Marching Method [ ], and 
cl(x) at each point of Bji'^^^ (s) can simultaneously be propagated via a transport equation as the front in the Fast 
Marching Method evolves, which is also rapid to compute. The computation of PBr.{c\{x)){It^i{^)^^) is then readily 
computed with the computation of cl{x) already available. 

Figure 3 shows all of the steps involved in optimization of the energies Eo and Ed. 
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(a) Initial appearance at\t and region Rt\t (left), and next 
image /t+i. As the man swings his arms to walk forward, 
new parts of the shirt and pant appear and part of his arm 
disappears. 



(b) Warped appearance and region after optimization with 
respect to the global affine motion for optimizing Eq (left), 
and the warped region displayed in /t+i. 













n 


1 




(d) Snapshots of the evolution of minimizing Eo (left to right, top to bottom). The current estimated region, the velocity (in 
standard optical flow color code), the warped template, and the occluded region estimates (in white) are displayed. The last 



three images on the bottom row show the final occluded region, the appearance at-\-i\t ^'^t 
and the final un-occluded region 



after the occlusion is removed. 




(e) The likelihood function ps^(ci(.)), the detected disocclusion, and the final estimates 
and 

Fig. 3: This figure illustrates all of the steps in determining the self-occlusion, warp, and the self-disocclusion given 
the previous estimates a^|^, and the image It^i. Although the detected occlusion and disocclusion are small 
sets that seemingly can be ignored, over time the occluded and disoccluded area become larger, and modeling this 
phenomena is essential (see Fig. 4). 
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V. Experiments 

In this section^, we demonstrate our method on challenging imagery displaying complex dynamic background, 
large deformation and articulation of the object, changes of viewpoint, self-occlusions and disocclusions, and even 
violations of brightness constancy. The video that we have chosen are of objects that appear close to the camera 
(i.e., near field). This is in contrast to typical video that are tested in many computer vision tasks (e.g., pedestrian 
tracking) where the object appears far from the camera, and precise shape is irrelevant. We segment the initial frame 
by hand to obtain the initial template, and the rest is automatically obtained by the proposed method. Although our 
primary concern in this paper is modeling self-occlusions of objects - a fundamental problem in computer vision, 
shape and appearance, we do show that the model can be used to obtain state-of-the-art results in obtaining the 
precise shape of the object in video. We compare the proposed method to a state-of-the-art method [19], [20] for 
obtaining precise cutouts from video. We obtained the results of [ ] from its implementation in the latest version 
of Adobe After Effects (CS6, 2012). We are not aware of any method in the literature that tracks by modeling and 
detecting self-occlusions and self-disocclusions, and we thus compare to [ ] even though the method does not 
explicitly have consideration of self-occlusion phenomena (compared to other methods in the literature, this method 
obtains the best results in terms of precise shape determination in object tracking). The fact that the method has 
been tested heavily in an industrial setting, and is industry state-of-the-art, implies that comparison to the method 
is the best choice. 

In all the experiments below (unless specified otherwise), the parameters are chosen as follows (based on a 
framerate of 30 fps and HD 720 quality video). The smoothness parameter for arising from length terms in both 
Eo.approx and Ed'i^ cr ^ 5, and the parameter ad = 100 in the definition of pB,- The thickness of the band, Br'^^^ {e), 
is 5 = 15. The radius of the window Br in the computation of pin and pout in the disocclusion stage is r = 20 
(i.e., a 40 X 40 window). The threshold for the occlusion stage is To = Res^^^i + 0.3 x (Res^ax — ^^^min) where 
RQSjnax (^^^min) dcuotcs the maximum (minimum) value of smoothed residual. The threshold for the disocclusion 
stage is = 0.5 when pB^ is normalized between and 1. The gain in the appearance update (11) is Ka = 0.7. 

First, in Figure 4, we demonstrate the necessity for modeling both occlusions and disocclusions of the object 
of interest. The images are cropped to the area of interest for display although the whole frames are processed. 
The video displays a moving camera and changes of viewpoint (hence changing background). As the man in the 
sequence walks forward, his legs self-occlude and self-disocclude; his arm self-occludes his body, and one of his 
hands disoccludes from the body. Also, due to viewpoint change of the camera, occlusions and disocclusions arise. 
Due to illumination and specularities (the face and skin, and even clothes), there are deviations from brightness 
constancy (and this is handled quite well with our dynamic appearance model). Also, the face is the same color 
as parts of the background, which adds to the complexity of this sequence. Note that in the top row using just the 
dynamic model without any consideration of occlusions or disocclusions, the estimated region quickly becomes 
inaccurate as the occluded region continues to be part of the estimate, and disoccluded regions fail to be updated. 
In the second row, using the model of self-occlusion / occlusion, it is now possible to model discontinuity of the 
warp due to self-occlusion (e.g., between the legs), and therefore, the method is able to discard the portion of the 
background between the legs. The occluded right hand in the second frame is correctly ignored. In the third row, 
using just the dynamic model without the occlusion modeling and including the disocclusion modeling, much of 
the disoccluded parts of the body are detected. However, as the warp cannot model discontinuities arising from self- 
occlusions (between the legs and also near the disocclusion of the second hand), and this leads to capturing irrelevant 
regions of the background. Best results (second row from the bottom) are achieved when both the occlusion (also 
modeling discontinuities of the warp) and disocclusions are modeled. The last row shows the result of [ ]. Self- 
occlusions of the legs and arm to body clearly present problems for the algorithm [ ] (as self-occlusion phenomena 
is not modeled), and thus it is apparent that occlusion phenomena must be modeled. Further, [ ] seems to have 
trouble when background and foreground share similar intensity, and irrelevant parts of the background are captured. 

In Figure 5, we test our method on a challenging sequence that displays significant articulation, self-occlusions, 
moving camera and background. The sequence displays changes of lighting on the lady of interest as movement 
with respect to the light source changes the illumination on the person. The clothing and hair of the women are not 
entirely Lambertian, and this adds to the difficulty of the sequence. Such deviations from Brightness Constancy are 

^Videos of experiments performed are available online at http://vision.ucla.edu/^ganeshs/articulated_object_tracking_html/ 
ObjectTrackingSelfOcclusions.html 



13 




Fig. 4: This figure shows the need to explicitly model occlusions and disocclusions. Left to right: frames 1, 10, . . ., 
100. When the occlusion and disocclusion detection are turned off (top), a deformation of the current appearance 
/ shape is inadequate to capture the shape / appearance in the next frame, and the errors propagate quickly over 
time. In the second and third row, occlusion and disocclusion, resp., modeling is turned on, and this leads to 
improvements. Fully accurate tracking is achieved when both occlusion and disocclusion modeling is performed 
(fourth row). The last row shows comparison with a state-of-the-art method [ ] for object tracking, which does 
not explicitly model occlusion phenomena (we are not aware of another method that models self-occlusions). The 
comparison shows the advantage of modeling occlusions/disocclusions in our or any other tracking framework. 



handled by the filtering of the appearance of the template in time (11). Further, there are several locations where 
the appearance of the lady is similar to the background (e.g., the pants of the women and the tires of the car, and 
near the end of the sequence where the background appearance is indistinguishable from the shirt of the lady). It 
is apparent that the proposed method obtains the shape of the lady with high accuracy, and topological changes 
of the projected object in the imaging plane are handled effortlessly. In contrast, the method [ ] cannot handle 
self-occlusions correctly, and is distracted by the background and over-segments the object. We note that mid way 
during the sequence, we have reinitialized [ ] by hand (this was not done for the proposed method). 

In Figure 6, we test the proposed method on tracking a man holding a briefcase and running at a train station. 
The sequence displays large deformation, self-occlusion and self-disocclusion, and large deviation from Brightness 
constancy due to illumination change and shadows. The proposed method gives highly accurate detections of the 
man, and is easily able to handle self-occlusions and disocclusions. In comparison, the method of [19] is unable 
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Fig. 5: This figure shows the proposed method (including both occlusion and disocclusion stages) (top two rows) on 
a challenging sequence that has many self-occlusions, a moving camera, areas where the object and the background 
are identical, changes of illumination, and specularities. Note that although changes of illumination and specularities 
are not explicitly modeled (and therefore modeled as noise), the dynamic filtering of the appearance (11) allows our 
method to be rather robust to such unmodeled phenomena. The bottom two rows show comparison to the method 
[19]; note that the method has been reinitialized in the bottom row. 



to handle occlusions and disocclusions, and once again over segments the object and is trapped by parts of the 
background. While the results of the proposed method are highly accurate, near the end of the sequence (2nd last 
frame in the second row) shows a limitation of our assumptions for detecting disocclusions. The sole of the man's 
shoe has been disoccluded, however, the appearance is not similar to nearby parts of the man, and thus, the sole 
of the shoe is not detected as a disocclusion. 

Figure 7 shows the results of the proposed method on tracking a fish in an underwater sequence. The sequence 
displays dynamic and cluttered background, significant viewpoint change of the fish with respect to the camera, 
minor self-occlusion and disocclusion, and changes of illumination due to change of position of the light source. 
The proposed method obtains highly accurate results, and the method in [ ] also obtains accurate results. Notice 
the finer details that are captured well with the proposed method (e.g., the nose of the fish - top row - last frame, 
and topological changes due to self-occlusion and self-disocclusion - second row - first frame). 

We give a quantitative analysis of the performance of our object tracking algorithm in Figure 8. We analyze 
the accuracy of the region, which represents the projected object obtained, from our algorithm. To facilitate the 
analysis, we have hand segmented all frames of all of the four sequences, and used the hand segmentation as ground 
truth. Note that ground truth obtained in this way may not be fully accurate since in much of the video, there are 
sometimes faint boundaries between the object and background, and hand drawing introduces errors (and as it is 
time-consuming, less sample points of the boundary of the object are used - resulting in inaccurate interpolation 
of the boundary) . Ideally, we would like to measure the performance of our algorithm in determining occlusions 
and disocclusions, but obtaining ground truth for occlusions and disocclusions is significantly harder. Obtaining 
an overall accurate segmentation of the object implies that the occlusion and disocclusion are likely to have been 
accurate, and in many applications the shape of the object is all that is desired. We measure the accuracy of our 
regions with three measures: the precision, recall, and F— measure of the pixels of the object. These measures are 
defined as follows: 

_ I ^alg n i?ground truth | _ | ^alg ^ i?ground truth | ^ _ 2 ' (^24) 

I ^alg I I Aground truth | Pr + Rc 
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Fig. 6: This figure shows the results of the proposed method on a challenging sequence displaying significant 
deformation, articulation, self-occlusions, and self-disocclusions. Notice that the shadows and illumination change 
on the man (due to the sun) defy brightness constancy. The filtering on the appearance in the proposed method is 
able to adapt to violations of brightness constancy. The top two rows shows the results of the proposed method, 
and the bottom two rows shows the results of [ ]. 




Fig. 7: This figure shows the results of the proposed method on an underwater sequence of a fish taken from 
the Red Sea. The sequence illustrates deformations due to viewpoint change and small self-occlusions and self- 
disocclusions. While the proposed method (top two rows) and [ ] (bottom two rows) achieve similar results, a 
closer inspection reveals that occlusions (e.g., near the tail) and appearance changes (near the nose) are handled 
more precisely with the proposed method. 
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Precision/Recall vs. Frame No. for Lib sequence 
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Fig. 8: This figure shows quantitative evaluation of the proposed method on the four image sequences displayed 
in Figures 4, 5, 6, 7. The curves show the precision, recall, and F-measure (compared to ground truth hand 
segmentation) versus frame number in the sequence. The solid curves indicate the results of the proposed method, 
and the dashed curves indicate the results of the method [ ]. Notice that precision, recall, and F-measure of the 
proposed method are high. The method of [ ] achieves high recall by over segmentation and thus compromising 
significantly with precision. Note the sudden increase of the dashed blue and green curves in the plot for the "Lady 
sequence" is due to hand reinitialization of the result given by [ ]. 



where i?aig is the region detected by the algorithm at a given frame, and | • | indicates area of the set. High precision 
indicates that most of the detected pixels are part of the true object, a high recall indicates that most of the pixels 
belonging to the object have been retrieved, and a high F-measure indicates that both precision and recall are 
high. Figure 8 shows both the precision, recall, and F-measure versus frame number for the proposed method and 
the method [' ] for all of the sequences tracked. Notice that the proposed method consistently obtains both high 
precision and high recall (mostly above 95% on all frames on all sequences). On the otherhand, [ ] obtains high 
recall by over- segmenting the desired object, and thus compromising heavily on precision. Indeed, the F-measure 
of our method is higher uniformly on all sequences. 

In Figure 9, we show a quantitative analysis of the sensitivity of the key parameters of the proposed method. 
We analyze the sensitivity of the thresholds To and in the occlusion and disocclusion detection stages using an 
precision / recall (PR) curve. For each image sequence, we choose a pair of images so that significant occlusion and 
disocclusion are present between the frames, and significant deformation and motion is present. Typically, the pair 
is separated by 5 frames on these sequences that have a frame rate of 30 frames per second. Given a hand cutout 
in the first frame, we run our algorithm (both occlusion and disocclusion stages) to obtain the cutout in the next 
frame. The first image in Figure 9 shows the PR curve as the parameter To is varied between its valid range (the 
minimum value of the residual. Res, and its maximum value), and the threshold of the disocclusion stage T^ is kept 
fixed. The second image in Figure 9 shows the precision / recall curve as the parameter T^ in the disocclusion stage 
is varied between its valid range (the minimum and maximum value of pbJ^ ^^d the threshold in the occlusion 
stage To is kept fixed. Note high precision and recall is maintained for a wide range of thresholds. As we do not 
have access to the source code of [ ], we could not test sensitivity of the parameters of that algorithm. 

Lastly, we state our approximate running time of our algorithm on a standard Intel 2.8GHz dual core processor. 
Note that the speed will depend on a variety of factors such as the size of the object and amount of deformation 
between frames. On HD 720 quality video, it is generally about 10 seconds per frame for the "Lady" and "Station" 
sequence (implemented in C-i~i-). Note that our main focus in this research was formulating the challenging problems 
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Precision vs. Recall Curve (occlusion stage) Precision vs. Recall Curve (disocclusion stage) 
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Fig. 9: The figure shows a quantitative assessment of the sensitivity of the key parameters (i.e., thresholds), To and 
Td, of the proposed algorithm for the occlusion and disocclusion stages in the sequences above. The Precision/Recall 
curves indicate that the parameters are robust to a wide range of thresholds that result in high values of precision 
and recall. 



of self-occlusion, shape and appearance, and not designing the most optimized code. A number of speed-ups are 
possible, for example, the main bottleneck is the joint optical flow computation and region computation, and this 
can be sped up using a multiscale procedure. Further, GPU usage is ubiquitous, and the method would certainly 
be much faster run on such a processor. 

VI. Conclusion 

We have presented a method for video object tracking when precise shape of the projected object is needed. 
The method applies to objects that have complex appearance and undergoing large deformation and articulation in 
near-field video, video where the camera moves and changes viewpoint with respect to the camera, and video with 
complex dynamic background. The technique is an adaptive template matching scheme that is implemented as a 
non-linear observer, and is based on the principle that object tracking should be performed by identifying stationary 
statistics of the object appearance and shape across time. The significant aspect in such a technique is that the 
shape of the object projected in the imaging plane is changing shape because of viewpoint change, 3D articulation, 
and self-occlusions and self-disocclusions. We have derived a model for the change of the shape and appearance of 
the object in time taking into account various types of occlusions, and derived a computational method to estimate 
the shape and appearance of the object at each time. We have showed that self-occlusions are fundamental to the 
success of object tracking. We have tested our algorithm on complex video, and showed that our algorithm is able to 
precisely determine the shape of the object of interest. Indeed, we achieve state-of-the-art performance. Future work 
includes extending the model to occlusion of the object by other objects, extending the algorithm to handle more 
complex illumination changes, and exploiting other types of assumptions in order to determine self-disocclusions. 

Appendix 

A. Numerical Methods 

We now describe the numerical implementation of the region and appearance evolution in the warping and 
occlusion stage (13), (14). The evolution of the region in (13) is done using Level Set Methods [ ] to 
conveniently handle topological changes and for numerical accuracy (although other methods may be used). Let 
^^T- : ^ R denote the level set function at time r such that {x ^ Q : ^^r(^) < 0} = The evolution equation 
is the following transport equation: 

dr'ifr(x) = -V^^r(^) ' ^r(^), X G dRr, (25) 

which is discretized using an upwinding difference scheme: 

^r^^^ {X) = {X) - At {vl^ {X)D^^ , v\ , x] + {x)D^, , , x]) (26) 
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where At > is the time step, and 

''<f\tl (27) 

[^xi*r.(^) if <(^) > 

where D^, (D~,) denotes the forward (backward, resp.) difference with respect to the coordinate, and Vr(x) = 
{vl{x),v'^{x)). Note that Vr\dRr is extended to the narrowband of the level set function by choosing the velocity 
in the narrowband to be the same velocity as the closest point on dRr. 

After each iteration of (26), the appearance, a^-.^^ is updated by a discretization of (14): 

{ar, (x) - At (v^. {x)Dx, [ar, , v^. , x] + v'^^ {x)Dx^ [a^^ , ^r, ^ ^]) ^ ^ ^ ^r,+i n Rr, 
E^Giv.nK.^ d^., {x,y)a.^{y) (28) 

where A^^^ denotes the eight neighbors of x, and (x, ^) denotes the distance between x and the zero crossing of 
the level set between x and y (zero if there is no zero crossing). In the computation of the forward/backward 
difference, if the relevant neighbor of x is not in R^. , then the difference is set to zero. It should be noted that the 
step size chosen to satisfy stability criteria, which means that the level set may not move more than one pixel and 
thus X will always have a neighbor that is in Rr^, and so the second case in (28) is well-defined. The step size At 
is chosen to satisfy At < 0.5/ mSiXxeR^.,j=i,2 \^T^{x)\. 

To solve (16), a conjugate gradient solver is ideal as the system is symmetric and positive definite. To use the 
conjugate gradient method in a discrete setting, one needs to write the discretized equation as a linear system 
Avk = b. We see that 

Avk{x) = -A J2 (^k(y) - Vk{x)) + lR^^\o,ix)W{x)VarXx)VarM^Vkix), X G (29) 

yeN^nR^. 

and 

b{x) = W{x){It+i{x) - ar^{x))Var,{x)lR^^\oA^)^ ^ ^ ^r. (30) 

where lR^,\Ok indicator function of Rr-\Ok, Var^x) uses central differences and a Neumann boundary 

condition, and Nx here denotes the four neighbors of x. With these definitions of A and 6, the conjugate gradient 
method can be used. Note that after the first iteration in i, the conjugate gradient method converges in a few 
iterations if the method is initialized with the velocity from the final velocity determined from i — 1 (where there 
is overlap with Rt,_J as the region Rr^ changes only by a few pixels between iterations in i. 
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